This is my (Hi, I’m Eirik) project page for course work related to the PHD-302 Introduction to Open Data Science course hosted at University of Helsinki, autumn 2022.
Link for the GitHub repository: github.com/Eiriksen/IODS-project
Found out about it from the doctoral school’s mailing lists, and signed up because I need the study credits and I hoped to learn more about general open data science concepts, especially data collaboration and using git. R I am already very familiar with so not expected to learn much new here on the technical side, but hoping to get some perspectives on how to use it in collaborative projects (where other’s have to read your code).
I mostly skimmed through these chapters and the exercies; I’ve already spent more hours than I dare to admit coding in R using tidyverse, dplyr, and ggplot, so I didn’t find much here that I was not familiar with. About the book, I’m not sure if it uses the best practice for getting people started with R; It seems to be written for someone who has never touched R or any kind of coding/scripting before, and still it completely skips the very very basics of understanding how R works and instead has the user jump straight into using R studio and installing tidyverse, something I think might create a lot of confusion. Hopefully others attending this course have some experience with R on beforehand.
library(tidyverse)
library(GGally)
tb_students <- read_delim("data/data-learning2014.txt")
str(tb_students)
## spec_tbl_df [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The dataset is the result of a survey of participants in the course Johdatus yhteiskuntatilastotieteeseen (Introduction to Social Statistics), fall 2014, and surveys the participants’ approaches to learning (ASSIST, section B), attitudes toward statistics (Based on SATShttps://www.evaluationandstatistics.com/scoring), as well as their points from their exams.
Info about the study: Metadata 1 Metadata 2 PPT presentation
183 of the course’s 250 students participated in the study. However, due to some missing data, there’s 166 participants in the final dataset.
As we can see from str(tb_students) above, The dataset has 166 observations of 7 variables. Each row is one participant and for each participant we know the age (numeric), gender (F/M), their exam points (num), and their scores (numeric) on four dimensions: attitude towards statistics (attitude), Deep learning approach (deep), Surface learning approach (surf), and Strategic learning approach (strategic). These scores are based several questions in the questionnaire where the students were asked to rate (from 1 to 5) how much they agreed to certain statements, for example “I usually set out to understand for myself the meaning of what we have to learn”.
Let’s have a quick look at some summaries of the data:
ggpairs(
data = tb_students %>% relocate(-gender),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.3), mapping=aes(color=gender)),
)+
scale_color_manual(values=c("F"="RED","M"="BLUE"))+
scale_fill_manual(values=c("F"="RED","M"="BLUE"))
Blue represents male participants, and red female participants.
This plot shows us the distribution of the different variables (diagonal), the plotted relationships between the variables (lower triangle), and the measured correlations between them (upper triangle). Correlations are not shown for gender since it’s a categorical variable.
From the top left plot we can see that most of the participants are aged around 20-30 years old, but that there are some participants in the older age groups slightly above 50 years. From the other diagonals we can see that most variables are fairly normally distributed.
There are also far more female than male participants (bottom right plot); We can get some more details on that using a summary table:
tb_students %>% group_by(gender) %>% summarise(n=n(), p=n() / nrow(.))
## # A tibble: 2 × 3
## gender n p
## <chr> <int> <dbl>
## 1 F 110 0.663
## 2 M 56 0.337
Which shows us that 66% (n=110) of the participants were female, and only 33% (n=56) male. However, looking at the rightmost column of the plot above, we can see that most variables don’t seem to associate strongly with gender, except maybe attitude, where the male participants are maybe scoring slightly higher. This can also be seen on the second plot of the bottom row, where it seems like for the female participants there is a higher amount of participants with a lower attitude score
Looking at the correlations, the strongest one (***) seems to be between points and attitude, where participants with a higher attitude-score seems to score more points at the exams. Surface learning also seems to correlate negatively with deep learning and strategic learning, as well as correlating negatively with attitude.
Let’s fit a regression model to examine some of the patterns more closely. We’ll use the variables attidue, age, and gender
model_full <- lm(points ~ attitude + age + gender, data=tb_students)
summary(model_full)
##
## Call:
## lm(formula = points ~ attitude + age + gender, data = tb_students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## attitude 0.36066 0.05932 6.080 8.34e-09 ***
## age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
It seems here that only attitude has a significant association with exam points, as the t-tests of the slope estimates only give a value below 0.05 for attitude. This statistical test checks whether the slope (estimate) of an explanatory variable is significantly different from 0. The idea here is that if the slope is 0, it will randomly deviate from 0 based on the t-distribution. Using the t-test we can check if the slope is more different from 0 than what is likely for t-distribution.
In our case, only the slope for attitude is significantly different from zero; If age or gender has an effect on exam points, we’re not able to detect it here.
We’ll simplify the model by removing the non-significant explanatory variables:
model_simple <- lm(points ~ attitude, data=tb_students)
summary(model_simple)
##
## Call:
## lm(formula = points ~ attitude, data = tb_students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Our estimated slope for attitude is 0.35. This means that for each extra point on the attitude score, the expected number of exam points goes up by 0.35. The multiple R-square value (coefficient of determination) of 0.19 tells us that our explanatory variable attitude is explaning 19% of the variation in the target variable exam points.
A key assumption for a basic liner model is that the residuals of the observed variable (points) are normally distributed
There are many ways of checking this, two are shown below: 1) just plot a histogram of the residuals. 2) plot a q-q plot. These two plots basically tell us the same thing, but the q-q plot makes it a little easier how much our residuals are straying from normal.
par(mfrow=c(1,2))
hist(residuals(model_simple),breaks=50)
plot(model_simple,which=2)
Our histogram and Q-Q plot tells us that our residuals are fairly normally distributed, but there are some extreme values towards the negative end.
Side not: Personally, I initially found it a little tricky to wrap my head around what is meant by “standardised residuals” and “theoretical quantiles”, so I did some googling and tried making the q-q plot from scratch. I understand it now but it’s actually quite tricky to explain with words; Anyways, here’s the R code for a q-q plot from scratch:
plot(sort(qnorm(seq(0,1,length.out=nrow(tb_students)))), sort(scale(residuals(model_simple))))
abline(coef=c(0,1))
Another important assumptions of the linear model is homoscedasticity (as opposed to heteroscedastisity), which means that the variance of our residuals stay the same independent of the explanatory variable (there should not be higher variance among students with a high attitude). We can check this with a residuals vs fitted plot:
par(mfrow=c(1,2))
plot(model_simple,which=1)
plot(tb_students$attitude, residuals(model_simple),)
abline(coef=c(0,0))
The plot shows us that the residuals stay more or less the same independent of the “fitted values” (explanatory variable). which is good! It could look like there’s less variance among the students with a high attitude-score, but this is likely just because there’s fewer students there, so this is probably not a problem)
Also, to help understand the content of the plot, I’ve plotted the same plot twice, once using the built in residuals vs fitted plot (right), and once by making the plot from “scratch” (right)
Finally, we might want to check for extreme outliers. outliers are data points with an extremely high residual value. Data points with a high leverage are points that have a high influence on the model. Points that both have a high residual and leverage are potentially bad for the accuracy of our regression.
plot(model_simple,which=5)
In our case, it looks like there’s no problematic points as none of them are outside of Cook’s distance, which is so far way here that it is not even shown on the plot.
Overall, it looks like our model fits reasonably well, although we should keep the negatively skewed residuals in mind.
library(tidyverse)
library(glue)
tb_alc <- read_delim("data/data-alc.txt")
print(names(tb_alc))
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures_m" "paid_m" "absences_m"
## [31] "G1_m" "G2_m" "G3_m" "failures_p" "paid_p"
## [36] "absences_p" "G1_p" "G2_p" "G3_p" "alc_use"
## [41] "high_use" "failures" "absences" "G1" "G2"
## [46] "G3"
The dataset is a product of questionaires handed to students of two Portuguese schools as well as course outcomes for two courses (math and portuguese), G1 G2 and G3 refers to the grades in the first period, second period, and final grade. The values G1,G2,G3, failures and absences are in this set the average of the scores for the two courses (for each student). The questionaire questions relate to demographic, social, and school related data. A full description of each variable and the dataset is available from this link Of special interest in this analysis is the values Walc and Dalc, which are scores from 1 (low) to 5 (very high) representing alcohol usage in the weekend (Walc) and in workdays (Dalc)
In this dataset we have also defined a two new variables: alc_use (the average of Walc and Dalc), and high_use (whether alc_use is higher than 2 or not)
Looking at different variables I think will have some relationship with high alcohol use.
go out (negative correlation) This variable represent how much students go out with friends. Now, this might fail miserably, but my hypothesis is that the students who often go out with friends have a lower chance of being high-alcohol users; They might drink some when they are out with freinds, but going out with friends often might also correlate with a stronger social network which can be protective against high alcoholism. Hypothesis: lower high_use with higher goout
sex (some correlation) A lot of traits and social aspect differ by sex in humans (either by culture or biology), so I’ll hypothesise that number of high-alcohol users is going to be different for the two sexes (Not hypothesising about the direction of the relationship)
study time (negative correlation) My guess is that high alcohol use is symptomatic of other problems in a persons life, students in an overall bad situation might have a higher alcohol usage, and also have less focus on studying. My hypothesis is thus that a lot of study time gives a lower change of high alcohol use.
famrel (negative correlation) Going by the same argument as above. Family relations are important, and students with bad family relations might seek relief in alchol usage. My hypothesis is that bad family relations increase the probability of high alcohol use.
One way to explore these data is to plot the proportion of high alcohol users for different levels of our variables in question (goout, sex, studytime, and famrel).
However, by only looking at the proportion, we’re missing out on some important data: How many students there are in each category. This can tell us the strengt of the different values.
To solve this, we’ll make plots showing both the proportion of high alcohol users for different levels of our variables, but also plot the distribution (histogram) of our variable in question.
To make this process a little cleaner, we’ll make a general function for making this plot (based on the variable in question “var”)
# Set up a general function to make these plots
makeplot <- function(var){
# Make a table where each row is one value of our explanatory variable...
# ... and the column "p_high_use" says the proportion of high alcohol users...
# ... there are in this this group
tb_alc_grouped <-
tb_alc %>%
group_by(!!var) %>%
summarise(p_high_use=sum(high_use)/n())
# find the max y value for the histogram we're gong to be plotting,
# We'll use that as a scaling factor for the second y axis of the plot
ymax <- tb_alc %>% group_by(!!var) %>% summarise(n=n()) %>% pull(n) %>% max()
scale = ymax
# make the plot, drawing both the histogram of our variable in question (x), as well as a line showing the
# - proportion of high alcohol users in for each value of x
ggplot()+
labs(title=glue("Distribution (bars) of {as.character(var)[2]} and the proportion (line) of \n high alcohol users in each category"))+
geom_histogram(data=tb_alc,aes(x=!!var), stat="count")+
geom_point(data=tb_alc_grouped,aes(x=!!var, y=p_high_use*scale),col="red",size=3)+
scale_y_continuous(
name = "N (bars)", sec.axis = sec_axis(~./(scale/100), name="% High alcohol use (line)")
)
}
Then we can make the plots:
makeplot(quo(goout))
This plot seems to go against my initial hypothesis. It seems like there’s a higher proportion of high alcohol users among those who go out a lot. There is quite a bit fewer students going out a lot than those going out a lot (4-5) than those going out less (1-3), but probably not few enough to drive any extreme relationships; also the pattern is quite strong. Based on this graph I’m guessing my hypothesis was not right, but let’s see what the analysis says later.
makeplot(quo(sex))
This plot seems to be in line with my hypothesis that the number of high alcohol users would be different between the sexes. It looks like there more high alcohol users among the male students than among the female students. There are slightly more female students than male students, but the different is not large.
makeplot(quo(studytime))
This also seems to go with my hypothesis that students who spend a lot of time studying also have low frequencies of high alcohol use. There is also relatively few students who study a lot.
makeplot(quo(famrel))
This one is interesting, it seems like students with a poor family relations do have a higher incidence of high alcohol use, except for when the family relations are really poor (famrel=1). However, note that the sample sizes for Famrel < 3 is also very low, so it is possible that the small sample sizes are driving more extreme (or “random”) values for incidence of high alcohol use. I would therefore be a little sceptical of drawing conclusions from this graph.
We’ll start by fitting a generalized linear model using our variables and a logit link function (family=“binomial”). Then also printing out the summary of the model
model = glm( high_use ~ goout + sex + studytime + famrel, family="binomial", data=tb_alc)
To get a better idea of the parameter estimates and their confidence intervals, we’ll print those out after having done an exponential transformation on them (transforming log odds rations to odds ratios)
exp(confint(model))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.06545486 1.1622100
## goout 1.73873662 2.8198312
## sexM 1.31841735 3.7630180
## studytime 0.43751933 0.8576353
## famrel 0.49791884 0.8636137
summary(model)[12]$coefficients %>% as.data.frame() %>% mutate(Estimate=exp(Estimate), `Pr(>|z|)`=round(`Pr(>|z|)`,5))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2816141 0.7312170 -1.733025 0.08309
## goout 2.1974324 0.1230422 6.398534 0.00000
## sexM 2.2165443 0.2669277 2.981891 0.00286
## studytime 0.6179355 0.1710716 -2.813859 0.00490
## famrel 0.6575181 0.1399361 -2.996245 0.00273
It seems like all our selected variables associate significantly with high alcohol usage (at \(\alpha=0.05\)), since the p value for each variable is lower then 0.05
Going through each variable:
Time spent going out. This variable is associating positively (opposite of my hypothesis, oh well), so more time spent going out means higher probability of having a high alcohol usage. Specifically, for each point (from 1 to 5, from very low to very high ) on self reported time spent going out, the students have a 2.2 times [95% CI: 1.74-2.82] higher odds of having a high alcohol consumption.
Sex. Male-identifying students have a 2.2 times [95% CI: 1.32-3.76] higher odds of having a high alcohol consumption compared to female-identifying students. Male students drink more than female students (supporting my hypothesis that male and female students have different levels of alcohol consumptions)
Studytime. For each point (from 1 to 5, going from less than two hours to more than 10 hours) of self reported time spent studying, the students have a 0.62-fold [95% CI: 0.44-0.86] lower odds of having a high alcohol consumption. More study time = less alcohol (as hypothesised)
Family relationship. For each point (from 1 to 5, going from very bad to excellent) of self reported family relationship, the students have a 0.66-fold [95% CI: 0.50-0.86] lower odds of having a high alcohol consumption. Better family relationships = less alcohol (as hypothesised)
# Creating the predictions
tb_alc_pred <-
tb_alc %>%
mutate(
probability = predict(model,type="response"),
prediction = rbinom(n=n(),1,probability)
)
# Making the cross tabulation
crosstab <- tb_alc_pred %>%
group_by(high_use) %>%
summarise(
n_wrong = sum(high_use!=prediction),
n_right = sum(high_use==prediction)
)
# Calculating the percentage wrong predictions
p_wrong = sum( tb_alc_pred$high_use != tb_alc_pred$prediction ) / nrow(tb_alc_pred) * 100
| high_use | n_wrong | n_right |
|---|---|---|
| FALSE | 65 | 194 |
| TRUE | 63 | 48 |
The table above shows us the cross tabulation of our predictions vs the training data. We can see that for students who did not have a high alcohol usage (“FALSE”), the model predicted 198 of them correct, and 61 wrong. For those with a high alcohol usage (“TRUE”), our model correctly predicted 48 right and 63 wrong. It seems like our model is thus more accurate for those who do not have a high alcohol usage. Going by our calculation above (p_wrong), the total percentage of incorrectly guessed students was 34.59%.
We can also plot this:
# Plotting the predictions
ggplot(tb_alc_pred)+
aes(x=probability,y=high_use,col=factor(prediction))+
geom_point(alpha=0.5)
The plot shows each student as one dot; The upper line is those with a high alcohol usage, and the lower line is those witout a high alcohol usage. Red points are those predicted to not have a high usage, and blue points are those predicted to have a high usage. The x-axis shows the estimated probability of high usage taken from the model.
As for the comparison with a simple guessing strategy; I found the question unclear and decided not to spend time on it. Leaving this part blank here.
# Setting up the loss function (same as in the exercise)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Running a 10-fold cross validation
library(boot)
cv <- cv.glm(data = tb_alc, cost = loss_func, glmfit = model, K = 10)
Our cross validation tell us that our function had an error of 0.2459459, which is actually better than the one used in the exercise. Looks like we already found a model with a better predictive accuracy. `
library(tidyverse)
library(GGally)
library(ggord)
tb_boston <- MASS::Boston
str(tb_boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
In the Boston dataset, each row is one town/suburb (total N=506) of the city Boston, Massachutes. The dataset contains various demographic, area-use related, environmental and infrstracture data. Full description. Some examples: - crim: Is the per capita crime rate in this area - nox: The nitrogen oxides concentrations (parts per 10 million) - rad: Index of accessibility to radial highways. - ptratio: Pupil-teacher ratio by town. etc…
color_correlation <- function(data, mapping, method="p", use="pairwise", ...){
# Function by user20650 on Stackoverflow (https://stackoverflow.com/a/53685979)
# grab data
x <- eval_data_col(data, mapping$x)
y <- eval_data_col(data, mapping$y)
# calculate correlation
corr <- cor(x, y, method=method, use=use)
# calculate colour based on correlation value
# Here I have set a correlation of minus one to blue,
# zero to white, and one to red
# Change this to suit: possibly extend to add as an argument of `my_fn`
colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]
ggally_cor(data = data, mapping = mapping, ...) +
theme_void() +
theme(panel.background = element_rect(fill=fill))
}
ggpairs(
data = tb_boston,
upper = list(continuous = color_correlation),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.3)),
)
That’s an overwhelming amount of correlating data! It’s almost easier to describe what’s not correlating rather whan what is. I’m guessing that the data here must be a selection of some larger data set where only the most interesting/correlating data has been kept in. We can also see that very few of the variables follow simple normal distributions, often having multiple peacks, or being highle skewed either to the left or to the right.
# A function for randomizing the order of rows in a table
randomize_rows <- function(tb){
return(
tb %>%
mutate( order=sample(row_number()) ) %>%
arrange(order) %>%
select(-order)
)
}
# Standardizing all variables,
# Turn crim into a categorical variable, divide by quantiles in ranges of 25%
# - and randomize the order of the rows in the dataset before cutting it of into test- and training sets later
tb_boston_st <- tb_boston %>%
mutate_all( ~ scale(.)[,1]) %>%
mutate(
crim = cut(crim,quantile(crim, probs=c(0,0.25,0.5,0.75,1)),labels=c("low","med_low","med_high","high"),include.lowest=T)
) %>%
randomize_rows()
cutoff_80 = round(nrow(tb_boston_st)*0.8)
tb_boston_st_train = tb_boston_st[1:cutoff_80,]
tb_boston_st_test = tb_boston_st[(cutoff_80+1):nrow(tb_boston_st),]
ggpairs(
data = tb_boston_st,
upper = list(continuous = color_correlation),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.3)),
)
After standardizing all the columns in the dataset, each variable is centered on zero, and its value is expressed in standard deviations. Because of this, it’s not easy to see the changes visually in the figure above; the only place you can really see the change is by looking at the axis values.
lindis <- MASS::lda(crim~.,tb_boston_st_train)
print(lindis)
## Call:
## lda(crim ~ ., data = tb_boston_st_train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2493827 0.2567901 0.2493827 0.2444444
##
## Group means:
## zn indus chas nox rm age
## low 0.95848346 -0.9120632 -0.155385496 -0.8801019 0.4360067 -0.8861534
## med_low -0.05084164 -0.3270113 -0.007331936 -0.5550686 -0.1126280 -0.2880256
## med_high -0.38450487 0.1956216 0.234426408 0.4083171 0.1100684 0.4386633
## high -0.48724019 1.0149946 -0.033716932 1.0029027 -0.3300043 0.7987064
## dis rad tax ptratio black lstat
## low 0.8918545 -0.6941859 -0.7348002 -0.40889554 0.37716249 -0.75509392
## med_low 0.3577450 -0.5412574 -0.4289550 -0.07406251 0.34724241 -0.15041474
## med_high -0.3706090 -0.3917183 -0.2922618 -0.24242667 0.07338671 -0.01214478
## high -0.8519107 1.6596029 1.5294129 0.80577843 -0.67353177 0.81979542
## medv
## low 0.493468462
## med_low -0.008480782
## med_high 0.144671627
## high -0.612454341
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12829236 0.63508634 -0.97236755
## indus -0.03160969 -0.30313962 0.09422943
## chas -0.05490313 -0.05442929 0.16113030
## nox 0.26629485 -0.83859719 -1.39150979
## rm -0.08935639 -0.10749564 -0.21416707
## age 0.37934876 -0.34881102 0.01593860
## dis -0.10428172 -0.27682460 0.14065211
## rad 3.20556311 0.78754290 -0.29645558
## tax 0.00665294 0.20367841 0.87506554
## ptratio 0.15000802 0.03150162 -0.35185687
## black -0.16940755 0.01699321 0.17140568
## lstat 0.15158170 -0.14044517 0.38292359
## medv 0.17152741 -0.32651576 -0.17254934
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9475 0.0396 0.0130
Here, the linear discriminant analysis has reduced our dataset down to three dimensions, LD1, LD2, and LD3, where most of the variance is explained by LD1. We can also plot our points against these dimensions:
cols <- c("cadetblue1","green","lightgoldenrod","orange")
ggord(lindis, tb_boston_st_train$crim, axes=c("1","2"), cols=cols )
ggord(lindis, tb_boston_st_train$crim, axes=c("1","3"), cols=cols )
ggord(lindis, tb_boston_st_train$crim, axes=c("2","3"), cols=cols )
In terms of studing the crime in the area, it seems like the firt linear dimension LD1 is the most interesting one, as it is the only one really separating the high-crime areas from the areas with lower crimes. It also seems like the variable “rad” is the one strongest associating with LD1, does a high access to highways associate with high crime rates?
Maybe living areas close to the highways are less desirable, and so high-resource residents live further away from highways, and that this demographic associates with lower crime.
# Making predictions
crim_predicted <- predict(lindis, newdata = tb_boston_st_test %>% select(-crim))
# Creating a cross tabulation
table(correct = tb_boston_st_test$crim, predicted = crim_predicted$class)
## predicted
## correct low med_low med_high high
## low 13 11 2 0
## med_low 4 14 4 0
## med_high 0 10 15 0
## high 0 0 1 27
Seems like our model got a few correct predicted (diagonal values). But what’s the percent correct guessed?
# Create alternative cross tabulation
tb_boston_st_test %>%
mutate(
predicted = crim_predicted$class
) %>%
group_by(crim) %>%
summarise(n=n(), n_pred_correct=sum(crim==predicted), p_pred_correct = n_pred_correct/n*100)
## # A tibble: 4 × 4
## crim n n_pred_correct p_pred_correct
## <fct> <int> <int> <dbl>
## 1 low 26 13 50
## 2 med_low 22 14 63.6
## 3 med_high 25 15 60
## 4 high 28 27 96.4
So, our model got about ~50% correct guesses for the categories low-med-high, but then quite a lot right for the high-crime areas! (94% correct)
dist_eu = dist(tb_boston_st, method="euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5267 4.9081 4.9394 6.2421 13.0045
The median euclidian distnace is 4.9, and the max distance is 13
Next, let’s try clustering our data. First, we’ll figure out what number of clusters are sensible to use. In the code below, we compare the total within-cluster sum of squares for different numbers of cluster. However, as an additional detail, I’ve done the same calculations 10 times over and then taken the mean for each choice of cluster number; This reduces the random variation in the graph and smoothens the curve a bit.
v_clusters = rep(1:15,10)
v_wcss <- sapply(v_clusters, function(k){kmeans(tb_boston %>% mutate_all(~scale(.)[,1]), k)$tot.withinss})
tb_kmeans <- data.frame(clusters = v_clusters, wcss=v_wcss) %>%
group_by(clusters) %>%
summarise(wcss=mean(wcss))
ggplot(tb_kmeans)+aes(x=clusters,y=wcss)+geom_point()
Seems like our biggest fall in total within-cluster sum of squares is when going from 1 to 2 clusters; The following additions of clusters still improve the sum of squares, but not nearly as much.
Let’s group our data by two clusters and show these clusters on the original graphical summary:
kmeans = kmeans(tb_boston %>% mutate_all(~scale(.)[,1]), 2)
tb_boston_kmeans <- tb_boston %>%
mutate_all(~scale(.)[,1]) %>%
mutate(cluster=kmeans$cluster)
ggpairs(
data = tb_boston_kmeans,
lower = list(continuous = wrap("points", alpha = 0.3, size=0.3), mapping=aes(color=factor(cluster))),
)
Here we can see visually which points belong to the two clusters, blue and red. What do to the clusters associate with? It seems like Blue associates with… - Higher-than-low crime rates - Low amounts of residential land zoned for lots over 25,000 sq ft (few large properties?) - High amounts of industry - High concentrations of nitrogen oxides - Somewhat lower amounts of rooms pr dwelling - More older buildings - Closer distance to employment centres - Higher accesibility of highways - Higher tax rate (full-value property-tax rate) - Lower proportion of blacks - Higher proportion of lower-status residents - Lower value of homes
Thus, one way to interpret this, is that the areas in boston can be divided into two clusters; one “high-value” cluster with low crime rates, expensive homes, nice air quality, low amounts of industry and so on; and then one “low-value” cluster with higher crime rates, less expensive homes but lower air quality and higher amounts of industry.
Finally, lets try separating the dataset into 3 clusters, and then running a LDA analysis on it
set.seed(123)
kmeans = kmeans(tb_boston %>% mutate_all(~scale(.)[,1]), 3)
tb_boston_kmeans_3 <- tb_boston %>%
mutate_all(~scale(.)[,1]) %>%
mutate(cluster=kmeans$cluster)
lindis_clustered <- MASS::lda(cluster~.,tb_boston_kmeans_3)
lindis_clustered
## Call:
## lda(cluster ~ ., data = tb_boston_kmeans_3)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2806324 0.3992095 0.3201581
##
## Group means:
## crim zn indus chas nox rm
## 1 0.9693718 -0.4872402 1.074440092 -0.02279455 1.04197430 -0.4146077
## 2 -0.3549295 -0.4039269 0.009294842 0.11748284 0.01531993 -0.2547135
## 3 -0.4071299 0.9307491 -0.953383032 -0.12651054 -0.93243813 0.6810272
## age dis rad tax ptratio black
## 1 0.7666895 -0.8346743 1.5010821 1.4852884 0.73584205 -0.7605477
## 2 0.3096462 -0.2267757 -0.5759279 -0.4964651 -0.09219308 0.2473725
## 3 -1.0581385 1.0143978 -0.5976310 -0.6828704 -0.53004055 0.3582008
## lstat medv
## 1 0.85963373 -0.6874933
## 2 0.09168925 -0.1052456
## 3 -0.86783467 0.7338497
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.03654114 0.20373943
## zn -0.08346821 0.34784463
## indus -0.32262409 -0.12105014
## chas -0.04761479 -0.13327215
## nox -0.13026254 0.15610984
## rm 0.13267423 0.44058946
## age -0.11936644 -0.84880847
## dis 0.23454618 0.58819732
## rad -1.96894437 0.57933028
## tax -1.10861600 0.53984421
## ptratio -0.13087741 -0.02004405
## black 0.15432491 -0.06106305
## lstat -0.14002173 0.14786473
## medv 0.02559139 0.37307811
##
## Proportion of trace:
## LD1 LD2
## 0.8999 0.1001
Here we can see that the linear discriminant analysis separataed our data into two dimensions, where LD1 is the most important one. The most important variables within LD1 seem to be rad (-1.96), tax (-1.11) and indus (-0.32).
Let’s have a look at this graphically:
ggord(lindis_clustered, factor(tb_boston_kmeans_3$cluster))
This shows us more or less the same. Especially cluster 1 seems to
separate nicely in the LDA analysis, with taxation and and access to
highways being the variables mostly separating these areas from the
other ones.
1: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
library(tidyverse)
library(GGally)
library(ggfortify)
tb_human <- read_delim("data/data-human-v2.txt",)
# Note: Here, I'm using a version of the human dataset without countries as row names.
color_correlation <- function(data, mapping, method="p", use="pairwise", ...){
# Function by user20650 on Stackoverflow (https://stackoverflow.com/a/53685979)
# for ggpairs function, adding color by correlation
# grab data
x <- eval_data_col(data, mapping$x)
y <- eval_data_col(data, mapping$y)
# calculate correlation
corr <- cor(x, y, method=method, use=use)
# calculate colour based on correlation value
# Here I have set a correlation of minus one to blue,
# zero to white, and one to red
# Change this to suit: possibly extend to add as an argument of `my_fn`
colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]
ggally_cor(data = data, mapping = mapping, ...) +
theme_void() +
theme(panel.background = element_rect(fill=fill))
}
tb_human %>%
select(-Country) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean=round(mean(value),2), sd=round(sd(value),2))
## # A tibble: 8 × 3
## name mean sd
## <chr> <dbl> <dbl>
## 1 Ado.Birth 47.2 41.1
## 2 Edu.Exp 13.2 2.84
## 3 Edu2.FM 0.85 0.24
## 4 GNI 17628. 18544.
## 5 Labo.FM 0.71 0.2
## 6 Life.Exp 71.6 8.33
## 7 Mat.Mor 149. 212.
## 8 Parli.F 20.9 11.5
ggpairs(
# Adjusting GNI to be expressed in thousands, to make graph more easy to read
data = tb_human %>% mutate(GNI = GNI / 1000) %>% rename(GNI_K=GNI) %>% select(-Country),
# Coloring by correlation
upper = list(continuous = color_correlation),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.3)),
) +
theme(axis.text = element_text(size=5))
Summaries of the different variables:
Edu2.FM: The mean ratio of female-to-male having at least some secondary education is 0.85, indicating that for most countries there are more male individuals with secondary eduction than female. There’s quite a lot of spread here, with many countries having even more male-favoring education systems, some very few countries having female-favoring system, and then quite a few countries having an equal amount of male- and female individuals having secondary eduction.
Labo.FM: Very few (maybe 4) countries have a female-biased workforce; most countries are strongly male-biased (mean ratio is 0.75), and there are quite a few countries with very low female participation in the workforce.
Edu.Exp: The average expected time of education is 13.18 years, though this variable is fairly bell-shaped with some countries having higher and lower average time spans (SD=2.84 years)
Life.Exp: Average life expectancy is 71.6 years, with a left-skewed distributions (more countries with long lifespans than low lifespan). However, there are some countries with very low expected lifespans, going down to some ~50 years.
GNI: The average national income (per capita) is 17 627.90. The distriubtion here is strongly right-skewed, meaning there are some countries with vastly higher income (up to some ~125 000 pr capita)
Mat.Mor: THe mean rate of maternal mortality is at 149 deaths pr 100 000 births, but the distribution is here highly right skewed, meaning some countires have a much higher maternal mortality rate (up to almost 1000 deaths pr 100 000 births).
Ado.Birth: The mean rate of births among adolescents is 47.16 births pr 1000 women ages 15-19. This measure is also fairly right skewed, with many countries having higher rates of births among adolescents (up to 200 pr 1000)
Parli.f: The average share of seats in parliaments held by females is at some 20%. This relationship is # select(-Country) %>% some countries having 0%- , and some countries having up to 60% of parliament seats held by female politiciants (SD = 11.5%)
Notable correlations:
There is a positive correlation between expected years of eduction, F/M ratio for higher education, life expectancy and GNI. People living in high-income countries have longer exptected educations, more women in higher education and longer life-expectancy.
On the other hand, rates of maternal mortality and births among adolescents correlate negatively to these measures.
Finally, the amount of females in parliament, and the share of females in the workforce don’t correlate strongly with any of these variables. The correaltions it does have, however, are somewhat counterintuitive; Countries with a high amount of females in the workforce also have a higher maternal mortality.
2: Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
First, running the PCA analysis and printing out the principal components:
# Running PCA
pca_human <- prcomp(tb_human %>% select(-Country))
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
Let’s look at how much variation is explained by different components:
# Displaying variability captured by different principal components
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
Note how PC1 seems to capture most of the variance, and that GNI seems to be the most important value in PC1
Let’s draw the biplot:
autoplot(pca_human, data = tb_human, loadings.label = TRUE, loadings.label.size = 3)
PCA biplot of HDI data (unstandardized). The plot seems to indicate that most of the variation in our dataset can be captured by principal component 1, which is strongly associated with the gross national income par capita. Principal component 2 is far less importan (explains 0.01% of the variation); here maternal mortality rates is the most strongly associated variable.
Interesting; we’ll discuss more below.
3: Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)
tb_human_st <-
tb_human %>%
mutate_at(vars(-"Country"),
~ scale(.)[,1]
)
pca_human_st <- prcomp(tb_human_st %>% select(-Country))
pca_human_st
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
summary(pca_human_st)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
Now this changes things! PC1 is still the most important, but not by far as much as previously.
autoplot(pca_human_st, x=1,y=2, data = tb_human, loadings.label = TRUE, loadings.label.size = 3)
biplot of HDI data (standarized). Here, the data is separating into two axes: PC1 (explains the most variation at 53.6%) relates to variables related mainly to economic- and health care development (such as life expectancy, gross national income per capita, expected number of years in eduction, maternal mortality rates, adolescent birth rates, but also proportion in females in higher education). The other (PC2) explains less variation (16.24%) axis relates female participation in politics and in the workforce.
The results are obviously very differently. The reason for this is that the variables are now expressed in relation to their own variation, not their absolute size. Previously, GNI was the most important variable partly because it also held the largest values and the most variation (same goes for maternal mortality), after adjusting it so that it is expressed in standard deviations, it is analysed on the same basis as the other variables. Thus, after standardizing, we’re analysing the dataset based on the relative variation in the selected variables, not the absolute variation.
4: Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
We can see that the data is divided on mainly two axes: PC1 seem to include variables related to overall economic and health-care development such as GNI, life expectancy, maternal mortality rates, and adolescent birth rates. The other axis seems to contain values related to female participation in the workforce and in politics. It is interesting to note that these two axis separate quite strongly, almost as if economic development and gender equality in the workforce and in politics is separated. On that note, it is also interesting that gender equality in education is one the same axis as economic development (PC1), but not on the “gender-participation” axis (PC2)
5: The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data. Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
tb_tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
The dataset countains 300 observations (people), with columns representing how they answered various questions. Almost all variables are factors with fairly few levels, often only two. Age is the only variable expressed as a number. The datasets seems to contain data about how people drink their tea, how often, and then with some question about personal details, and about how they feel about tea drinking.
For the multiple correspondent analysis, we’re only going to look at the first 19 columns as they seem to be the most important ones for traits related to tea drinking in itself.
mca_tea<- FactoMineR::MCA(tb_tea, quanti.sup=19, quali.sup=c(20:36), graph=FALSE)
# visualize MCA
plot(mca_tea, invisible=c("ind","quali.sup","quanti.sup"), graph.type = "classic",cex=0.7)
plot(mca_tea, invisible=c("var","quali.sup","quanti.sup"), graph.type = "classic",cex=0.7)
The first plot shows us how different answers relate to the two main principal component, while the second plot shows us where different individuals fall on these two dimensions. We can see that the populations is fairly homogeneous, there aren’t any obvious groupings among the respondents. On the first graph, we can try to make sense of what the two dimensions associate with. On the upper extreme of Dim 1 we see answers such as “tearoom”, “chainstore+teashop”, “tea bag+unpackaged”, “always”, “pub”, “lunch”, and “work”, while on the lower extreme we have answers like “not.tea time”, “not work”, “p_cheap”, “not.home”, “not.tearoom” etc. One way to interpret this is that dimension one separates frequent tea drinkers from those drinking tea rarely. The second dimension could relate to buying patterns of tea. Those on the higher end of Dim 2 like to buy more expensive, unpackaged teas from specialiced tea shops, while those on the lower end buy their teas from chain stores, use tea bags, have milk in their tea, and drink Earl Grey.
The RATS data set contain data from 16 subjects (rats), which were measured over 11 days during a 9 week long experiment. The rats were divided among 3 feeding treatments, and we’re looking to see if there is a difference in growth rate between the treatments.
First, loading in the data (long form):
library(tidyverse)
tb_RATS <- read_delim(
file = "data/data-rats.txt",
col_types = list(ID = col_factor(), Group = col_factor())
)
tb_RATS %>%
ggplot()+
aes(x=day, y=weight, label=ID, col=Group)+
geom_text()+
labs(title="A) Weight of rats over the study time period", subtitle = "Colors:groups, Numbers: individual ID")
tb_RATS %>%
ggplot()+
aes(x=day, y=weight, col=ID, facet=Group)+
geom_line(size=1,alpha=0.6)+
facet_wrap(~Group) +
labs(title="B) Weight of rats over the study time period", subtitle = "Facets indicate groups 1-3, each line is one rat")
tb_RATS %>%
ggplot()+
aes(x=day,y=weight,col=Group, shape=Group)+
stat_summary(fun.data=mean_se, geom="errorbar", position = position_dodge(2), width=3,size=0.9)+
stat_summary(fun=mean, geom="line", position = position_dodge(2), size=1, alpha=0.2)+
stat_summary(fun=mean, geom="point", position = position_dodge(2), size=3)+
labs(title="C) Mean weight (±SE)of rats in each group (1-3) over the study time period",
subtitle = "Points are dodged slightly to avoid overlap",
y="Mean weight (±SE)")
Here I’ve use a few different plotting approaches to look at how the weight of the subjects (rats) change over the study period.
A) Looks a little messy, here each weight of each individual is plotted against time, and the individuals are identified by the numerical ID. This gives us an idea of what the data looks like, but it’s hard to separate some individuals.
B) Here, I’ve used line plots to connect the measurements of the different individuals so we an more easily see the changes over time. The three facets (subplots) are fore the tree feed groups. We can clearly see there is some difference between the groups, we can track each individual, and we can see that in group 2 there is one individual that is much larger than the others in that group.
C) Here, we’re using summarized data for each day and treatment. We can see the mean weight of each group for each measurement day, with the standard errors to gives us an idea of how different the groups are. However, here, we’re hiding the original datapoints, but showing them would also make the plot very messy. In any case, we can see that the weights are increasing over time in all the groups. However, it’s hard to tell if there is a difference in growth rate, which is what we’re interested in finding out from this data.
tb_RATS %>%
ggplot()+
aes(x=Group, y=weight)+
geom_boxplot()+
labs(title="A) Mean weights of rats in each treatment group, for the entire study period")
tb_RATS %>%
ggplot()+
aes(x=Group, y=weight)+
geom_violin(alpha=0.3)+
geom_jitter(aes(col=ID))+
labs(title="B) Distributions of weights of rats in each treatment group (violins)",
subtitle="Points indicate single measures of individuals, colors indicate individual")
Here are some more summary plots, similar to those used in Chapter 8 of the MABS plot. Each plot simply shows the average weight recorded in all treatments (over the entire study period). These plots really are not that useful for figuring out if there is a difference in growth rate between the treatments, but they do show the absolute weight differences.
One interesting thing to note: In plot A it looks like we have some outliers (group 2). I decided to investigate this further and made an alternative plot which shows the original data points (with violin graphs in the background showing the distribution of the points), and here we can see that while there are some points which look like outliers when looked at together, we can see that within that individual they are really not outliers but fall within a reasonable range for the purple individual in group 2. I think this underscores the importance of account for individual ID when working with these kind of data. Another way to make this plot would be to not use every single data point from each individual but instead use their mean weight over the study as a data point. -still, that’s throwing away good data!
tb_RATS %>%
group_by(ID,Group) %>%
summarise(slope = lm(weight~day)$coefficients[2]) %>%
ggplot()+
aes(x=Group,y=slope)+
geom_violin(alpha=0.5, color=NA, fill="gray")+
geom_jitter(width=0.1,height=0)+
stat_summary(fun.data=mean_se, geom="errorbar",width=0.3,size=1,alpha=0.3)+
stat_summary(fun=mean, geom="point",size=5,alpha=0.3)+
labs(
title="Difference in weight gain for different groups of rats",
subtitle="Large points and errorbars indicate group means ± SE",
y="Estimated weight gain (g/day)"
)
The previous graphs were not very usefull for telling if there is a difference in growth rate between the groups. We need some measure of growth rate. One way to do this could have been to take the difference between the first and last data point and then use that value for growth rate – but then we’d be throwing away all the data between the first and last point; not very exciting.
Really the way to do this would be to use a linear mixed effect model, but I’m not supposed to in this part of the assignment so I cam up with another “hacky” solution. In the graph above, I have for each individual done a small linear model on their growth data and then extracted the slope from that model. Thus, each data point in the graph above is the “slope” for each individual. Then, in the graph above, we can compare the slopes between the three groups by looking at the means and the standard errors (though I’ve also plotted the original data points and the distributions). What we see is that group 1 definitively looks to have a lower growth rate then group 2 and 3, but I wouldn’t be so sure about the difference between group 2 and 3.
Keep in mind that this is not really a good solution though, because we’re essentially doing analysis on parameters (the slopes we calculated); having calculated the means and stander errors of these, we’re essentially calculating parameters of parameters! We’ll most certainly be violating some assumptions when we put these slope-data into any statistical test.
…so let’s do that next!
model_fit <-
tb_RATS %>%
group_by(ID,Group) %>%
summarise(slope = lm(weight~day)$coefficients[2]) %>%
lm(slope ~ Group, data=.)
anova(model_fit)
## Analysis of Variance Table
##
## Response: slope
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 1.00665 0.50332 7.5743 0.006594 **
## Residuals 13 0.86387 0.06645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In chapter 8 of the MABS book, they do both a T test and an anova. However, we’re dealing with three groups here (not two), so a T-test is out of the question. (we could do multiple t-tests, but then we’d also have to do multiple-test correction). Instead, I’m only doing an ANOVA analysis on the slopes I calculated. What the summary tells us is that there is a significant difference in slopes associated with the groups, but it can’t tell us which group is different from which. This is about as far as we get without turning to proper linear mixed models.
This data contains data for 40 subjects which were given BPRS scores (a psychological rating scale, indicating schizophrenia) over 8 weeks, and were put in one of two treatment groups.
Loading in the data:
library(lmerTest)
tb_BPRS <- read_delim(
file = "data/data-BPRS.txt",
col_types = list(subject = col_factor(), treatment = col_factor())
)
tb_BPRS %>%
ggplot()+
aes(x=week, y=BPRS, col=subject, facet=treatment)+
geom_line()+
facet_wrap(~treatment)+
labs(title="A) BPRS scores of subject the study time period")
tb_BPRS %>%
group_by(week) %>%
mutate(BPRS_scaled = scale(BPRS)) %>%
ggplot()+
aes(x=week, y=BPRS_scaled, col=subject, facet=treatment)+
geom_line()+
facet_wrap(~treatment)+
labs(title="B) scaled BPRS scores of subject the study time period")
The two plots above show us how BPRS scores of all the participating individuals developed over the 8 weeks. Plot B) is supposed to be easier to read because the variables are scaled (scaled for each week), but I honestly don’t find that plot much more helpful than the first one. In any case it seems that in both treatments the BPRS scores go down over time, but we can’t really tell how much and if there is a difference between the groups.
To tell that, we’ll need to do some modelling.
Note: To save space, for the following summaries I’m only printing the model coefficients.
As in MABS Ch9 I’ll fit some linear models (and linear mixed models) of increasing complexity, and then I’ll compare them in the end.
First, fitting a completely ordinary linear model:
mod_1_normal <- lm( BPRS ~ week + treatment, data=tb_BPRS)
summary(mod_1_normal)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4538889 1.3670163 33.9819579 6.256351e-114
## week -2.2704167 0.2524021 -8.9952367 1.412930e-17
## treatment2 0.5722222 1.3033989 0.4390231 6.609104e-01
According to this first model, the BPRS scores significantly go down for each week (by -2.27 points pr week), but the effect of the treatments is more uncertain (treatment 2 might have larger BPRS scores then treatment 1, but it is not a significant difference so we can’t tell).
Next, let’s try fitting a mixed effect (ME) model, letting subject ID act as a random effect on the intercept:
mod_2_rInt <- lmer( BPRS ~ week + treatment + (1|subject), data=tb_BPRS)
summary(mod_2_rInt)$coefficients
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 46.4538889 2.4095150 43.20949 19.2793521 7.596500e-23
## week -2.2704167 0.1505526 319.00000 -15.0805494 3.605128e-39
## treatment2 0.5722222 3.2994257 38.00000 0.1734309 8.632333e-01
Using this model gives us a very similar result. The significance of the effect of time increased, but we’re still uncertain about the effect of treatment 2.
Next: Increasing the complexity, same model as above but now subjects have a random effect on the effect of time (week) instead of a random effect on the intercept.
mod_3_rInt_rSlope <- lmer( BPRS ~ week + treatment + (week|subject), data=tb_BPRS)
summary(mod_3_rInt_rSlope)$coefficients
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 45.982973 2.7054200 47.78312 16.9966118 6.816057e-22
## week -2.270417 0.2747198 38.99793 -8.2644820 4.216588e-10
## treatment2 1.514053 3.2207855 37.99912 0.4700881 6.409789e-01
Still a similar result. The effect size of the treatment is higher now, yet still non-significant.
Finally, same model as above, but now with an interaction fitted between week and treatment:
mod_4_rInt_rSlope_intractn <- lmer( BPRS ~ week * treatment + (week|subject), data=tb_BPRS)
summary(mod_4_rInt_rSlope_intractn)$coefficients
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 47.8855556 3.0614637 38.00344 15.6413926 3.802479e-18
## week -2.6283333 0.3849267 38.00052 -6.8281404 4.204415e-08
## treatment2 -2.2911111 4.3295635 38.00344 -0.5291783 5.997594e-01
## week:treatment2 0.7158333 0.5443685 38.00052 1.3149793 1.963973e-01
The effect of time (week) is still very similar here (and significant), but the treatment effect has changed a lot! Treatment 2 now gives a lower BPRS score (-2.29) than treatment. Also, the effect of time (week) now differs between the treatments; In treatment 1, subject go down in BPRS faster than in group 2. However, both the treatment effect and the interaction is still non-significant here.
Let’s compare the models using an ANOVA analysis:
anova(
mod_2_rInt,
mod_3_rInt_rSlope,
mod_4_rInt_rSlope_intractn
)
## Data: tb_BPRS
## Models:
## mod_2_rInt: BPRS ~ week + treatment + (1 | subject)
## mod_3_rInt_rSlope: BPRS ~ week + treatment + (week | subject)
## mod_4_rInt_rSlope_intractn: BPRS ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df
## mod_2_rInt 5 2582.9 2602.3 -1286.5 2572.9
## mod_3_rInt_rSlope 7 2523.2 2550.4 -1254.6 2509.2 63.663 2
## mod_4_rInt_rSlope_intractn 8 2523.5 2554.6 -1253.7 2507.5 1.780 1
## Pr(>Chisq)
## mod_2_rInt
## mod_3_rInt_rSlope 1.499e-14 ***
## mod_4_rInt_rSlope_intractn 0.1821
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis tells us that model 3 (including ID as a random effect
on time, but not including the interaction) has the highest likelihood
(-the probability of observing our data if this model was true), and
significantly so. This indicates model 3 could be the best
model.
Let’s also try doing some cross validation to compare
the models. The following code splits the data set into two (even/odd
subject IDs), fits each model separately to each data set (giving two
fits), and uses the parameter from each fit to predict the data not used
in the model. It then does this procedure for each model:
set.seed(1337)
library(cvms)
# Cross validation (split based on batch number)
cvms::cross_validate(
data = tb_BPRS %>%
mutate(
id_numeric = as.numeric(substr(subject,2,4)),
.folds=factor(ifelse((id_numeric %% 2) == 0,yes=1,no=2))
),
models = c(
"BPRS ~ week + treatment",
"BPRS ~ week + treatment + (1|subject)",
"BPRS ~ week + treatment + (week|subject)",
"BPRS ~ week * treatment + (week|subject)"
),
family="gaussian",
REML=T
) %>% select(Fixed,Random,RMSE,AICc)
## # A tibble: 4 × 4
## Fixed Random RMSE AICc
## <chr> <chr> <dbl> <dbl>
## 1 week+treatment <NA> 13.8 1396.
## 2 week+treatment (1|subject) 13.8 1272.
## 3 week+treatment (week|subject) 13.9 1245.
## 4 week*treatment (week|subject) 13.8 1245.
The cross validation does not seem to quite agree with the ANOVA analysis, but it’s close. The Residual Mean Square Error (RMSE, a measure of how close the predictions were to the observed data, smaller is better) is best for model 4; so model 4 does marginally better on cross validation. The ICC (not related to cross validation, but a measure of model fit compensating for complexity) is also lower (better) for model 4. However, the difference is very very small.
The safest thing to conclude is probably that time in treatment had an effect on BPRS scores (both model 3 and 4 agrees on this), and that there may be a difference between the treatments in how the subjects develop over time, but this effect is somewhat uncertain (only model 4 gives us a significant week x treatment interaction).
Finally, let’s plot the original data vs the predicted data from model 4:
tb_BPRS$fitted = F
tb_BPRS_fitted <-
tb_BPRS %>%
mutate(BPRS = fitted(mod_4_rInt_rSlope_intractn), fitted=T)
ggplot(
bind_rows(tb_BPRS,tb_BPRS_fitted)
)+
aes(x=week,y=BPRS,col=subject,facet=fitted)+
geom_line()+
facet_wrap(~interaction(treatment,fitted))
The plot above shows the observed data for group 1 and 2 (top row), and the predicted data for the same groups (bottom row). The main observation here is that is that the predicted data looks fairly similar to the observed data, meaning our model is not completely off.
Thanks for grading this assignment, and have a happy holidays!!!